Tidyverse: Data Wrangling 101

0. Getting started

0.1 Target audience

This course is targeted at Master’s and Ph.D. students with a basic understanding of the R programming language, but want to manipulate (large) datasets in R instead of using spreadsheet software. Being comfortable with basic R jargon (e.g., vectors, functions, objects, …) and operations is strongly recommended.

0.3 Prerequisites

A basic understanding of R is assumed. As we have a lot of ground to cover, it would be unwise to jump in unprepared! If you are not yet comfortable with R basics (loading a package, importing data, creating vectos and dataframes), I strongly recommend the swirl package, which interactively introduces you to R (see swirl’s website](https://swirlstats.com/students.html)) for more information). To get started with swirl right away, install the package using install.packages("swirl"), load it into your session with library(swirl) and jump-start your journey with swirl(). Going through the first chapter (1: Basic Building Blocks) should suffice, but don’t let that stop you from learning moRe!

0.4 Overview

For an overview of all sections covered within this material, please refer to the sidebar of this document or use the hyperlinks shown below. Sections 1 to 4 are not required for working with tidyverse, but are recommended to expand your understanding of how these packages work the way they do. Additionally, you’ll learn how to deal with importing data, including a section on larger datasets. Starting from section 5, we will get our hands dirty with actual tidyverse data manipulation.

1. Introduction

1.1 Context

As a biology student, I was introduced to R in the very first year of the programme. With R being my first scripting language, it was as much an uphill struggle as any other new language. In the second year, R was thrown on the table again in the context of statistics, with another round of RStats in the Master’s programme, 2 years later. In this time, I used R only as a means to perform statistical tests. As real, raw data was rarely in the format that was presented during any of the statistical courses, I cleaned, filtered, pivoted, … all of it using MS Excel. If you haven’t already, this can be a very time- and energy-consuming endeavour! Indeed, we never really learned how to clean and wrangle our datasets, leading to a lot of trouble and frustration toward data analysis.

During my thesis, however, I found out about ‘Tidyverse’, but never truly submerged myself. At the start of my Ph.D. in October 2020, I seized the moment to learn the ropes of this set of packages, and learned more about R along the way. To potentially save you a lot of time and trouble - whether you are a Master’s or Ph.D. student, or even something beyond that -, I want to share with you some of the things I have learned along the way. For the record, I’m far from an expert on the matter, and there is still a lot left to explore!

This origin story aside, hopefully this material will prove to be helpful somewhere along your data journey. There are many ways to deal with data tidying and wrangling, and the tidyverse just happens to be one of them. Feel free to send any and all feedback you may have to .

1.2 Tidyverse

The tidyverse is “an opinionated collection of R packages designed for data science”. In that sense, tidyverse can be represented as a virtual basket containing different packages, which “all share an underlying design philosophy, grammar, and data structures”. In other words, these packages and their corresponding functions easily interact with each other, allowing for a wide range of tools to tinker with data.

If you haven’t already, install the tidyverse (install.packages("tidyverse")) and load it into your R environment.

  # load tidyverse
#install.packages("tidyverse")
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

As shown above, a series of packages will be installed. Each of these packages is listed below, along with a brief description borrowed from the packages’ documentation. In addition to these ‘core’ packages, other R libraries are also installed along with tidyverse, but are mostly beyond the scope of this workshop.

  • ggplot2: A system for ‘declaratively’ creating graphics, based on “The Grammar of Graphics”.
  • dplyr: dplyr provides a grammar of data manipulation, yielding a consistent set of verbs that solve the most common data manipulation challenges.
  • tidyr: tidyr provides a set of functions that help you acquire tidy data. Tidy data is data with a consistent form: in brief, every variable goes in a column, and every column is a variable. This is part of the core philosophy of tidy data.
  • readr: readr provides a fast and friendly way to read rectangular data (like .csv, .tsv, and .fwf). It is designed to flexibly parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes.
  • purrr: purrr enhances R’s functional programming toolkit by providing a complete and consistent set of tools for working with functions and vectors. Once you master the basic concepts, purrr allows you to replace many for loops with code that is easier to write and more expressive.
  • tibble: tibble is a modern re-imagining of the data frame, keeping what time has proven to be effective, and throwing out what it has not. Tibbles are data.frames that are lazy and surly: they do less and complain more forcing you to confront problems earlier, typically leading to cleaner, more expressive code.
  • stringr: stringr provides a cohesive set of functions designed to make working with strings as easy as possible. It is built on top of stringi, which uses the ICU C library to provide fast, correct implementations of common string manipulations.
  • forcats: forcats provides a suite of useful tools that solve common problems with factors. R uses factors to handle categorical variables; variables that have a fixed and known set of possible values.
  • Others: broom, cli, crayon, dbplyr, dtplyr, googledrive, googlesheets4, haven, hms, httr, jsonlite, lubridate, magrittr, modelr, pillar, readxl, reprex, rlang, rstudioapi, rvest, xml2

If you have already installed tidyverse earlier, you may want to check whether all packages contained within are up-to-date.

  # check for updates
tidyverse::tidyverse_update()

If a package is out of date, you will receive a notification and instructions to update outdated packages.

Try `tidyverse_packages(include_self = TRUE)` and see for yourself!

Try tidyverse_packages(include_self = TRUE) and see for yourself!

1.3 Package conflicts: masking

As shown in 1.2 Tidyverse, library(tidyverse) attaches multiple packages to your R session. Additionally, a couple of so-called Conflicts will be shown. As these conflicts are not exclusive to tidyverse, but become apparent when you start loading packages into R, it is important to know what exactly these conflicts entail.

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

In English, the functions filter() and lag() from the dplyr package share their names with filter() and lag() from the stats package (included with any R installation). In other words, once dplyr has been attached to your R session, filter() and lag() from the stats package will no longer be accessible (unless called explicitly using e.g. stats::filter() ). The :: in dplyr::filter() indicates that filter() originates from the namespace of dplyr (“the space in which all names, belonging to a package, reside”). More specifically, :: allows accessing a specific package’s functions without loading the entire package into R (see ?'::' for more details).

If you end up working with many different packages, you will need to take such conflicts into account. One error I have come across very often, is whenever using dplyr and raster (in case I load raster after a tidyverse library, thereby masking dplyr::select with raster::select).

  # load packages
library(tidyverse)
library(raster)

Loading required package: sp

Attaching package: ‘raster’

The following object is masked from ‘package:dplyr’:

    select

The following object is masked from ‘package:tidyr’:

    extract

  # select 'artist' and 'track' columns
billboard %>% 
  select(artist, track)

Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘select’ for signature ‘"spec_tbl_df"

Even though we were already warned that dplyr::select was masked by raster::select, we still tried using the select() function as if it was called from the dplyr package. Instead, select() was called from the raster package (which is aimed at geometrically subsetting raster or spatial objects), triggering the error (as select from the raster package cannot deal with the provided data object).

Another way to find out which namespace a function is called from, is by entering a function’s name (without brackets ()) in your R console. In case of the code above, the following message would be printed for select:

standardGeneric for "select" defined from package "raster"

function (x, ...) 
standardGeneric("select")
<bytecode: 0x000002a8f658cda0>
<environment: 0x000002a8f6968a48>
Methods may be defined for arguments: x
Use  showMethods(select)  for currently available ones.

2. The whole game

2.1 In a nutshell

“Tidy datasets are all alike, but every messy dataset is messy in its own way.”
Hadley Wickham, a play on Tolstoy’s “Happy families are all alike; every unhappy family is unhappy in its own way.”

More often than not, raw data will not be formatted in a very accessible, analysis-friendly way. Experiments generally do not produce clean trees.csv or covid_cases.csv files, but datasets that are wild, exotic or downright savage. This is especially likely if someone else collected data for you, but did not have any prior knowledge about your general set-up. If you cannot recall hours of tedious data tinkering in Excel following a group lab practical, then have you really lived your student life to the fullest? ;-)

If only Robbie would stop bothering Alexa...            Source: [Jon Carter](https://www.kdnuggets.com/2017/09/cartoon-machine-learning-class.html)

If only Robbie would stop bothering Alexa… Source: Jon Carter

To deal with such datasets, rigorous cleaning and wrangling is required before even thinking about modelling or visualizing the story residing within your data. The schematic below (as shown in R for Data Science; great reference material!) encompasses the entire process of data science with tidyverse; this course is (mostly) limited to the parts highlighted in blue. Most importantly, data tidying and transformation will be taking center stage, with some notes on importing data, and dipping our toes in visualization at the end.

Data wrangling. Source: [R for Data Science](https://r4ds.had.co.nz/wrangle-intro.html)

Data wrangling. Source: R for Data Science

2.2 Data cleaning vs data wrangling

The above schematic in words: Raw data first needs to be imported by loading it into the R environment (which usually means effectively loading data into memory (RAM)). Once loaded, data often requires tidying (or data cleaning) and transformation (or data wrangling) prior to any analyses down the line. In more exact terms, data cleaning is the process through which errors are fixed and data quality is ensured, while data wrangling would be defined as the process through which raw data is manipulated and transformed. This is only a matter of semantics, and will not impact the flow of this workshop.

3. Tibbles and pipes

Before diving head first into the tidyverse, we will need to talk about tibbles and pipes. Very many functions within the tidyverse produce tibbles, making it at least worth mentioning. Pipes, on the other hand, are a powerful tool for clearly expressing a sequence of multiple operations.

3.1 Tibbles

Tibbles are dataframes, but with a twist. For the sake of comparison and clarification, we will create a classical R data frame and a tibble with the same content.

  # set seed for reproducibility
set.seed(1)
  
  # create a dataframe
a_dataframe <- data.frame(x = 1:25,
           y = rnorm(25, 1, 2))

  # create a tibble from the dataframe
a_tibble <- as_tibble(a_dataframe)

Here, we have created two datasets, each containing the same information: column x with numbers from 1 to 25, and column y with 25 random observations drawn from a normal distribution. Both have the same number of variables (2) and observations (25), and will produce the same results (try running e.g. all.equal(mean(a_dataframe$y), mean(a_tibble$y))). This begs the question of what, exactly, is different?

One hint toward the answer can be obtained by running both objects (simply a_dataframe and a_tibble in the R console) and reviewing the output.

A regular dataframe (left) and a tibble (right). The tibble shows a couple of distinct features to improve printing and inspection of your data.

A regular dataframe (left) and a tibble (right). The tibble shows a couple of distinct features to improve printing and inspection of your data.

As shown in the figure above, a couple of features are shown in a tibble that are non-existent for ‘regular’ dataframes. In a way, a tibble is nothing more than a data frame with some extra ‘quality of life’ features. However, there are other important differences going on under the hood, encapsulating best practices for data frames. In most cases, tibbles and dataframes can be used interchangeably, though some packages/functions will return an error as they do not recognize tibbles as dataframes with a cherry on top. Read more on tibbles here.

3.2a Pipes: short version

A pipe is a special operator aimed at making code more intuitive to read and write (though opinions may differ), and is not exclusive to R. The tidyverse pipe is written as %>% (CTRL+SHIFT+M in RStudio) and originates from the magrittr package (to be pronounced with a sophisticated french accent). They are automatically loaded in with library(tidyverse) or library(dplyr), but can also be loaded separately using library(magrittr) or library(magrittr, include.only = "%>%").

In brief, the magrittr pipe passes the output of what comes before the pipe (left-hand side) as input to the function after the pipe (right-hand side). The pseudo-code below shows what this looks like in the context of baking cookies in a factory, going through the functions and pipes as if it were a conveyer belt or pipeline.

raw_ingredients <- ("butter", "sugar", "eggs", "chocolate chips", "...")

choc_chip_cookies <- raw_ingredients %>% 
  make_dough() %>% 
  shape_cookie() %>% 
  transport_to_oven() %>% 
  bake_yummie_cookies() %>% 
  cool_cookies() %>% 
  pack() %>% 
  send_away()

In case you don’t completely understand the %>% yet, I’ve written a more lengthy section below (3.2b Pipes: long version). It will help you to better grasp the benefits of the operator, but is not required to get you going with tidyverse. The bottom line remains: %>% passes what comes before to what comes after the pipe, effectively creating a virtual pipeline of consecutive operations.

NOTE: As of R version 4.1.0, R also sports a native pipe operator |>. Its behaviour is highly similar to %>%, but is now part of the R language itself, while %>% needs to be imported from a package. For further reading, head on over to following comparison. By default, RStudio may resort to the %>% pipe (as RStudio and tidyverse are developed by the same team). This behaviour can be changed in the settings of RStudio via Tools > Global options > Code > Editing. The |> operator will not be covered in this course, though I have grown to like it slightly more than %>% (mostly aesthetically :-)).

3.2b Pipes: long version

To give you a working example, I will provide some code (see below) with one of the most used built-in R datasets: mtcars. Don’t worry too much about the different functions used (we will get back to most of those later!), but pay attention to the use of the %>% operator in example 1.

  # example 1: piping
df_cars <- mtcars %>% 
  rownames_to_column("car") %>% 
  filter(str_detect(car, "Merc")) %>% 
    # convert miles per gallon -> km per liter
  mutate(kml = mpg*(1.60934/3.78541)) %>% 
  select(car, kml, cyl, hp)

As the benefit of the %>% will not immediately become clear following this example, consider the following code blocks (examples 2 to 4):

  # example 2: nesting
df_cars <- select(
  mutate(
    filter(rownames_to_column(mtcars, "car"), str_detect(car, "Merc")),
    kml = mpg*(1.60934/3.78541)), 
  car, kml, cyl, hp)
  # example 3: overwriting
df_cars <- rownames_to_column(mtcars, "car")
df_cars <- filter(df_cars, str_detect(car, "Merc"))
df_cars <- mutate(df_cars, kml = mpg*(1.60934/3.78541))
df_cars <- select(df_cars, car, kml, cyl, hp)
  # example 4: base R
df_cars <- mtcars
df_cars$car <- row.names(df_cars)
df_cars <- df_cars[, c(ncol(df_cars), 1:ncol(df_cars)-1)]
df_cars <- df_cars[grep("Merc", df_cars$car), ]
row.names(df_cars) <- NULL
df_cars$kml <- df_cars$mpg*(1.60934/3.78541)
df_cars <- df_cars[c("car", "kml", "cyl", "hp")]

All of the above examples will yield the same df_cars object at the very end. In my humble opinion, example 1 is the most readable and maintenance-friendly code by far (but, again, mileage and opinions may vary). Once you become used to piping multiple operations together into one chain, you no longer need to intermediately save or overwrite old data (with some exceptions that are bound to cross your path). Additionally, code becomes more intuitive and readable if used correctly.

Another thing you may (or may not) have noticed, is how in example 1 (as well as in example 2, but for different reasons) df_cars is only mentioned once, while example 3 and 4 mention the object df_cars 7 and 15(!!) times, respectively. As for example 2, nesting all of the functions limited the number of mentions of df_cars, at the cost of readability. Naturally, one could also nest some of the operations in example 4, but I think we all have better things to do!

In any case, where is each function in example 1 getting their data from, or how does it know which one to use? And what is the order of execution of each of these function calls? To explain this, imagine a factory that produces your favourite type of cookie - I will go with the classic ol’ chocolate chip. At one end, raw ingredients (butter, sugar, eggs, chocolate chips, …) are delivered to the factory’s doorstep. At the other end, the factory pumps out boxes chock-full of delicious cookies.

Of course, we all know the factory isn’t a black box, but rather an intricate system of many different steps. First, raw ingredients need to be mixed into a batter and thickened into a dough. Next, this dough is poured into moulding machines where the cookies are given their iconic shape. Then, the cookie-shaped dough moves down a conveyor belt to an industrial oven for baking. Finally, after cooling these heavenly cookies, they are packed and sent away. Let’s write this into some pseudo-code using the magrittr pipe:

raw_ingredients <- ("butter", "sugar", "eggs", "chocolate chips", "...")

choc_chip_cookies <- raw_ingredients %>% 
  make_dough() %>% 
  shape_cookies() %>% 
  transport_to_oven() %>% 
  bake_yummie_cookies() %>% 
  cool_cookies() %>% 
  pack() %>% 
  send_away()

In case you hadn’t noticed yet, the pipe passes the output from the function before it, to the function after the pipe (in the example above, it passes the output from the first line to the next line as the input). As such, raw_ingredients is passed on to make_dough(), and the result of make_dough() is passed on to shape_cookies(). Once the data has gone through send_away(), it is stored in the object called choc_chip_cookies - the finished box of cookies, if you will. In terms of the code above, you could also write each of the cookie-making steps on one line of code (like a virtual conveyor belt), but this would make the code much less readable (head on over to this style guide for more info on code styling within tidyverse).

Without going into too much detail, this behaviour is also engrained into most tidyverse functions. Most of these functions use the output of whatever comes before the pipe as the input for the operation after the pipe. If you want to explicitly refer to this input-output within these function calls, the dot (.) placeholder can be used:

raw_ingredients <- ("butter", "sugar", "eggs", "chocolate chips", "...")

choc_chip_cookies <- raw_ingredients %>% 
  make_dough(.) %>% 
  shape_cookie(.) %>% 
  transport_to_oven(.) %>% 
  bake_yummie_cookies(.) %>% 
  cool_cookies(.) %>% 
  pack(.) %>% 
  send_away(.)

For now, this is all you need to know about pipes (and far more than I knew when I started out). In brief, the magrittr pipe passes the output of what comes before the pipe (left-hand side) as input to the function after the pipe (right-hand side). For more technical information, see here.

NOTE: As of R version 4.1.0, R also sports a native pipe operator |>. Its behaviour is highly similar to %>%, but is now part of the R language itself, while %>% needs to be imported from a package. For further reading, head on over to following comparison. By default, RStudio may resort to the %>% pipe (as RStudio and tidyverse are developed by the same team). This behaviour can be changed in the settings of RStudio via Tools > Global options > Code > Editing. The |> operator will not be covered in this course, though I have grown to like it slightly more than %>% (mostly aesthetically :-)).

*'Ceci n'est pas une pipe'*, by Belgian artist René Magritte, which served to be the etymological inspiration for the [*magrittr*](https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html) package.

‘Ceci n’est pas une pipe’, by Belgian artist René Magritte, which served to be the etymological inspiration for the magrittr package.

3.3 Tidy data

As already touched upon in the description of the dplyr package, tidy data is data with a consistent form and follows three rules:

  • Each variable must have its own column.
  • Each observation must have its own row.
  • Each value must have its own cell.
The three rules of tidy data. Source: '[R for Data Science'](https://r4ds.had.co.nz/tidy-data.html)

The three rules of tidy data. Source: ’R for Data Science’

If data is tidy, then every variable goes in a column, and every column is a variable. Tidy data is not desirable in all cases, but can prove to be a very robust way of structuring data when using tidyverse. For those who have already worked with ggplot2 may know what I am talking about! For more information and examples, check 12.2 Tidy data in R4DS.

4. Importing data

4.1 readr and base R

As this workshop is more about the actual data cleaning and wrangling, I will only go over importing data very briefly.

Within tidyverse, the readr package was developed to provide a fast and friendly way to read rectangular data (e.g. csv). The readr functions you’re likely to use most often are:

  • read_csv() to read comma (,) delimited files;
  • read_csv2() to read semicolon (;) separated files (a common file type in Belgium, where , is used as the decimal point);
  • read_tsv() to read tab delimited files;
  • read_delim() to read files with any delimiter.

As you may already know, base R also has tools to import data (e.g. read.csv(), read.csv2, read.table(), …), but these are generally slower than those provided by readr. Regardless, this is unlikely to be an issue unless you are working with larger datasets (>1 million observations).

4.2 Big data

In case your workflow is suffering from large data files, fear not - there are many powerful tools at your disposal! Introducing vroom and data.table. Both packages use multiple threading, which is very beneficial if your computer possesses multiple CPU cores (which is often the case, nowadays). This, along with some other nifty features, allows reading and writing data very fast (one could say vrrroooom). As opposed to data.table, vroom does not fully read data into memory, but only indexes it. This means that only columns and rows you would actually put to use would be read.

That said, data.table still ought to be faster in case of numeric data than vroom. On top of that, data.table provides tools and syntax to wrangle data much more efficiently than e.g. tidyverse functions. In addition, data.table is also faster and more memory-efficient in doing so, but its syntax is more difficult to read and write. As a compromise, dtplyr has been called into existence within the tidyverse, which uses the same ‘tidy verbs’ you’ll become familiar with, but translates this into data.table syntax to benefit of its sheer speed (with some minor loss of speed due to overhead, and loss of memory-efficiency). One additional thing I want to mention about data.table is its very convenient and fast way of reading (fread) and writing (fwrite) data with. It is highly similar to base R’s read.table(), but automatically detects column separators, data types, and has many arguments to customize the function call.

Nevertheless, as of readr 2.0.0, the package uses vroom as a backend, granting an impressive speed boost. Below, I provide a benchmark comparing base R’s read.csv(), the previous (**.old*) and current version of read_csv() (**.new*), as well as data.table::fread() to import a .csv file containing 16 columns of > 34 million rows, mostly numerical data (total size: 3.56 GB, uncompressed).

Unit: seconds
                 expr        min         lq       mean     median         uq        max neval
      base.read.csv() 111.372018 113.808849 114.516900 113.859472 116.155927 117.388234     5
 readr.read_csv.old()  37.959150  38.016099  38.741874  38.960399  39.156582  39.617139     5
 readr.read_csv.new()   4.326258   4.599566   4.941215   4.649826   4.681539   6.448888     5
           dt.fread()   3.465170   3.667079   3.716254   3.742631   3.815739   3.890653     5

There are ways to optimize functions (such as read.table()) to handle data more efficiently, or by parallelizing operations, but these are very situational and far beyond the scope of this workshop.

To find out more about vroom and data.table, click on the embedded links.

5. Data manipulation with dplyr

Once you’ve loaded your data into R, it’s time to start tinkering! A package that aims to make handling and manipulating data easier and more efficient is dplyr (pronounced d-ply-r), which you’ll use most often when it comes to data manipulation. It presents itself as a grammar of data manipulation, providing a couple of functions to solve most common data manipulation challenges, also referred to as verbs: select(), filter(), mutate(), arrange(), etc.

These so-called verbs can be used together harmoniously, aiming to make data wrangling a much smoother and readable experience. For your reference, I will list the functions used in 5.10 Overview each accompanied by a short description. Worked out examples will be provided in the next sections. Similar examples can be found in the package’s vignette by using vignette("dplyr") in R. For sake of completion, I will provide some base R alternatives, as well (dplyr<->base R).

Using one of the datasets included with dplyr, we will explore the world of Star Wars. As you can see below, we have 14 variables (columns) with 87 observations (rows), so we have a lot to work with! The glimpse() function is included in the dplyr package (but is actually exported from tibble, which, in turn, exported it from pillar - phew!), and gives us a ‘glimpse’ of the dataset.

glimpse(starwars, width = 75)
Rows: 87
Columns: 14
$ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Le…
$ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 1…
$ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brow…
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "ligh…
$ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "b…
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 5…
$ sex        <chr> "male", "none", "none", "male", "female", "male", "fem…
$ gender     <chr> "masculine", "masculine", "masculine", "masculine", "f…
$ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan…
$ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", …
$ films      <list> <"The Empire Strikes Back", "Revenge of the Sith", "R…
$ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>,…
$ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced…

5.1 select()

Often, datasets will contain a lot of information you don’t need. For this purpose, it can be useful to keep only the columns you are using in further analyses.

To select a column, simply type the name(s) of the column(s) you want to keep. Conversely, the same can be achieved by writing the column name(s) you don’t need, preceded by the - operator. Compare the pieces of code below, as well as the results.

  # select columns name, height, mass, homeworld and species
starwars %>% 
  select(name, height, mass, homeworld, species)

starwars %>% 
  select(-hair_color, -skin_color, -eye_color, -birth_year, -sex, -gender, -films, -vehicles, -starships)

  # base R solutions
starwars[c("name", "height", "mass", "homeworld", "species")]
subset(starwars, select = c(name, height, mass, homeworld, species))

You may be wondering whether there aren’t any more efficient ways to select columns, rather than typing out each individual one. Fortunately, there are multiple ways to achieve the same result! In a lot of cases, the : operator can be used to select adjacent columns (FROM:TO), as shown below.

starwars %>% 
  select(name:mass, homeworld, species)

starwars %>% 
  select(-c(hair_color:gender, films:starships))

You can even select columns without ever mentioning any (full) column names, using one of the selection helpers; see ?tidyselect::language for a list of all tidyselect helper functions. Below, I will show where() and contains().

  # select all numeric columns
starwars %>% 
  select(where(is.numeric))

  # select all columns that contain the letter 'o'
starwars %>% 
  select(contains("o"))

Now we know how to use select(), let’s apply it to already perform some cleaning! Because the columns films, vehicles and starships are list-columns, we will omit them from any further manipulations.

starwars <- select(starwars, -c(films, vehicles, starships))

5.2 filter()

Filtering can be performed for many reasons. Perhaps you are only interested in Star Wars characters with a certain hair or skin colour, a certain species, or characters who are at least as tall as the expected average height of a healthy population as defined by the WHO growth reference standards. Whatever your flavour may be, all of this can be done using filter() to return only specific rows.

  # characters with a gold skin colour
starwars %>% 
  filter(skin_color == "gold")

  # base R solutions
starwars[starwars$skin_color == "gold", , drop = FALSE]
subset(starwars, skin_color == "gold")

If we want to take it a step further, we can also combine different statements together. Try running the following code:

  # return only characters that are:
    # masculine,
    # not with a golden skin colour, 
    # and at least 176.5 cm tall
starwars %>% 
  filter(gender == "masculine", 
         skin_color != "gold", 
         height >= 176.5)

Any of the base R relational operators (?Comparison) can be used in conjunction with filter(). By default, the , acts as the & operator within a filter() call.

We can also provide a vector to filter a column, for which the %in% operator is used. In conjunction with the ! operator (preceding the column name you are filtering on), you can also exclude parts of your dataset. Try running the code below and spot the differences.

  # characters with eye_color == "blue" OR eye_color == "red"
starwars %>% 
  filter(eye_color %in% c("blue", "red"))

  # characters NOT with eye_color == "blue" OR eye_color == "red
starwars %>% 
  filter(!eye_color %in% c("blue", "red"))

However, there are also a couple of rows that contain multiple colours, as is shown by unique(starwars$eye_color), meaning our current filtering methods are not completely waterproof. Without going into detail, the stringr package (included with tidyverse) provides a solution to overcome this issue (e.g. stringr::str_detect()). Run the following code to confirm that, indeed, all characters with red eyes (even if the character has multiple eye colours) are returned.

starwars %>% 
  filter(str_detect(string = eye_color, pattern = "red"))

Working with string vectors is a whole endeavour on its own and cannot be covered extensively within the scope of this course. If you are curious to know more about what is called “regular expression” (abbreviated to “regex”), feel free to head on over to stringr’s vignette for an introduction to regex using stringr.

5.3 arrange()

While filter() selects or omits certain rows, arrange() simply reorders them (i.e. no rows are removed). By default, it arranges rows in an ascending order. In case of character columns, they are reordered alphabetically (A-Z), while numeric columns will be ordered from smallest to largest. For the record, arrange() reorders the entire dataframe according to the column you have selected.

  # reordering a character column from A to Z
starwars %>% 
  arrange(name)

  # reordering a numeric column from smallest to largest values
starwars %>% 
  arrange(height)

If we want to reorder rows in a descending order, then the helper function desc() can be used.

  # reordering a numeric column from largest to smallest values
starwars %>% 
  arrange(desc(height))

Furthermore, multiple columns can be used for reordering. For completion’s sake, I will also provide the base R equivalents. While the first column will be used for arranging the entire dataset, consecutive columns will be used as ‘tie-breakers’ (if at least two characters have the same skin colour, then they will be sorted by species in the example below).

  # dplyr: arranging using multiple columns
starwars %>% 
  arrange(skin_color, species)

  # base R: arranging in ascending order using multiple columns
starwars[order(starwars$skin_color, starwars$species), , drop = FALSE]

5.5 mutate()

Whenever you want to create a new variable (whether it is created based on data in your dataframe or not), or you want to change the content of existing columns, mutate() is the function to use. Its syntax is simple, but very versatile. Let’s start with a classic dplyr calculation: the BMI of each Star Wars character. Looks like good ol’ Jabba really needs to watch his carbs!

  # calculate BMI and arrange from highest to lowest
sw_bmi <- starwars %>% 
  mutate(bmi = mass/(height/100)^2) %>% 
  select(name:mass, bmi) 

sw_bmi %>% 
  arrange(desc(bmi))

What’s cool about mutate() is that you can use newly created variables as part of the same mutate() call, as shown below. Let’s pretend the force is inversely related to a character’s BMI times its birth year. First, we can calculate each character’s BMI, after which we can use that newly created variable to obtain a character’s force level. The runner-up may come as a surprise…

  # calculate the force based on an individuals mass
starwars %>% 
    mutate(
      bmi = mass/(height/100)^2, 
      the_force = 1/bmi*birth_year) %>% 
    select(name, the_force) %>% 
    arrange(desc(the_force)) 

You can also mutate existing columns ‘in place’.

  # transform columns
starwars %>% 
  mutate(
    height = height/100, 
    mass = mass*.5,
    skin_color = factor(skin_color)
  )

A more advanced and concise way of modifying multiple columns at the same time combines across() and tidyselect helper functions. Let’s pretend, for one moment, that it would be interesting to multiply all numerical columns by 10. First, I’ll show a more verbose (wordy, long-winded) piece of code, followed by a tidy version.

  # verbose code
starwars %>% 
     mutate(across(where(is.numeric), function(x) {x*10}))

  # tidy code
starwars %>% 
     mutate(across(where(is.numeric), ~.x*10))

In the code block above, across() allows using select()-wise semantics to manipulate multiple columns at the same time. This can be done by defining specific columns, but also by using helper functions such as where(). On all of the selected columns, a function is applied. The first approach uses the classic R syntax (function(x) {do something}), but its verbosity can hamper the code’s readability at a glance. To this end, the purrr-style lambda approach (using ~ or ‘twiddle’) provides more concise syntax using shortcuts (e.g. .x as a placeholder for each column that is selected), which can make (short and simple) anonymous functions much more readable. The same logic applies for named functions, as shown below.

  # create function to multiply by 10
force10 <- function(x){
  x*10
}

  # apply function
starwars %>% 
  mutate(across(where(is.numeric), ~force10(.x)))

Creating new columns in base R isn’t all too difficult either, but does not allow creating a piped workflow as shown in the tidyverse examples. Additionally, a newly created variable cannot be used within the function call that said variable was created in (as opposed to mutate()) , unless it is nested.

transform(starwars, bmi = mass/(height/100)^2)
starwars$bmi <- starwars$mass/(starwars$height/100)

5.6 group_by() and summarize()

Sometimes, one simply wants to summarize a dataset, e.g. to calculate the average height or weight of Star Wars characters. For this purpose, summarize() creates a new dataframe that summarizes all observations defined in the function’s input. Below, we calculate the average height and weight of Star Wars characters using the height and mass columns as input. Because we are also interested in the variation of the data used to calculate these metrics, we calculate the respective standard deviations using sd(). Looks like the average character weighs approximately 97 kg and is approximately 174 cm tall (although these metrics’ standard deviations may provide some additional context)!

Keep in mind that summarize() (which can also be typed as summarize(); many tidyverse functions have synonyms facilitating use in both British and American English!) only retains the summary columns defined within the function call!

starwars %>% 
  summarize(average_height = mean(height, na.rm = TRUE),
            stdev_height = sd(height, na.rm = TRUE),
            average_weight = mean(mass, na.rm = TRUE),
            stdev_weight = sd(mass, na.rm = TRUE))

We can up the ante by calculating the average height of characters between species. In other words, we want to summarize our data based on groups within that data. To tell R we want to perform such grouped operations, we use group_by(), subsequently calculating our summary data with summarize(). We will use an intermediate filter(n() >= 3) step to include only species who count at least 3 individuals in our dataset, and remove rows that contain NA in the species column using tidyr::drop_na(). For that purpose, dplyr::n() can be used to return the sizes of each group.

starwars %>% 
  group_by(species) %>% 
  filter(n() >= 3) %>% 
  drop_na(species) %>% 
  summarize(average_height = mean(height, na.rm = TRUE), 
            average_weight = mean(mass, na.rm = TRUE), 
            individuals = n()) %>% 
  arrange(-average_height, -average_weight)

As you can see, grouping can be very powerful and versatile. In the example above, we group our dataset by species, filter it to keep only groups that contain at least 3 observations, and summarize() to calculate some averages and the size of our groups. This works because grouping can be used by most dplyr functions (mutate(), filter(), summarize(), …). Notice how the grouping variable (species) is retained in the resulting dataframe!

However, with great grouping power comes great grouping responsibility! Cheesiness aside, errors and unpredictable behaviour can occur when one isn’t mindful of their groupings. That is, group_by() does not alter the data points themselves, but alters the structure of your data as a whole.

sw_subset <- starwars %>% select(name, height, mass, gender)

sw_subset_gr <- sw_subset %>% group_by(gender)

  # ungrouped data
glimpse(sw_subset, width = 75)
Rows: 87
Columns: 4
$ name   <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia O…
$ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, …
$ mass   <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77…
$ gender <chr> "masculine", "masculine", "masculine", "masculine", "femin…

  # grouped data
glimpse(sw_subset_gr, width = 75)
Rows: 87
Columns: 4
Groups: gender [3]
$ name   <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia O…
$ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, …
$ mass   <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77…
$ gender <chr> "masculine", "masculine", "masculine", "masculine", "femin…

Indeed, the glimpse() function (alternatively, you can use str() or simply print the dataframe to your console) shows an additional line of information for sw_subset_gr, which is not shown for the ungrouped data: Groups: gender [3]. This implies that grouping is applied to the sw_subset_gr, meaning subsequent operations will take this into account whenever possible (such as the filter(n() => 3) used earlier).

To remove grouping, either use summarize() in case you want to create a summary of your data, or ungroup() if you simply want to remove the grouping without any other manipulations.

My personal take on grouping and ungrouping is that I will always (try to) add ungroup() following the use of group_by(), even, if I have already used summarize(). Even though summarize() removes a layer of grouping, the additional use of ungroup() makes the operation explicit, clearly conveying that your functions operate on grouped or non-grouped data. The examples provided earlier do not conform with this notion for the sake of brevity.

Bonus: count() provides a ’quick and dirty” convenience function so you can easily count e.g. the number of individuals that share the same species name, rather than going through a separate group_by() and summarize() call.

starwars %>% 
  count(species) %>% 
  arrange(-n)

instead of

starwars %>% 
  group_by(species) %>% 
  summarize(n = n()) %>% 
  arrange(-n)

The base R approaches to summarize, especially when grouped, can be quite complicated and are, therefore, not shown here. If you want to know some ways to perform these operations without any package dependencies, please refer to this GitHub Gist for inspiration.

5.7 slice()

The slice() function and its derivatives (slice_head(), slice_tail(), slice_max(), slice_min()) allow subsetting rows using their respective positions in a dataframe. Additionally, slice_sample() is used to obtain a random set of rows from a dataframe.

Let’s store a pre-wrangled dataset into an R object we can work with.

sw_force <- starwars %>% 
  mutate(
    bmi = mass/(height/100)^2, 
    the_force = 1/bmi*birth_year) %>% 
  select(name, the_force) %>% 
  arrange(name) %>% 
  drop_na(the_force)

First of all, slice() allows selecting or removing specific rows (based on their row number) from a dataframe. This can be useful if a dataset contains specific rows with weird or useless values (e.g. sentences or empty lines, as you would sometimes expect in spreadsheets containing multiple datasets in a single sheet).

  # select the first line of dataframe
sw_force %>% 
  slice(1)

  # select lines 1, 5 and 10
sw_force %>% 
  slice(1, 3, 5)

  # select lines 5 through 10
sw_force %>% 
  slice(5:10)

  # remove row 2 using the - operator
sw_force %>% 
  slice(-2)

Using slice_head() and slice_tail(), we can perform these operations on the n first or last rows.

  # 5 first rows
sw_force %>% 
  slice_head(n = 5)

  # 5 last rows
sw_force %>% 
  slice_tail(n = 5)

In case the force is strong with you, you may have noticed I didn’t arrange sw_force by the_force as before, but by name. In other words, all previous operations acted upon this specific ordering of data, and do not yield any information on characters with the highest or lowest the_force values. To obtain these data (without using arrange() first), slice_max() and slice_min() are called into action.

  # top 5 characters with highest the_force values
sw_force %>% 
  slice_max(order_by = the_force, n = 5)

  # top 5 characters with lowest the_force values
sw_force %>% 
  slice_min(order_by = the_force, n = 5)

Lastly, we could also be interested in sampling random rows from a dataframe, which can be done with slice_sample().

set.seed(1) # for reproducibility 
sw_force %>% 
  slice_sample(n = 5)

5.8 distinct()

Whenever you suspect you are dealing with duplicate data (which can occur for many reasons), distinct() can be a lifesaver, though it can be a little bit tricky if you’re not careful. At least in the Star Wars universe, clones are common practice.

First, I will create a resampled dataset from the starwars object to create duplicate rows. After showing that, indeed, we are dealing with proper clones, we will call distinct() to save the day.

  # create starwars clones
set.seed(1)
clone_wars <- starwars %>% 
  slice_sample(n = 100, replace = TRUE)

  # confirm duplicates using name
clone_wars %>% 
  count(name) %>% 
  filter(n > 1) %>% 
  arrange(desc(n))

  # keep only distinct rows
nonclone_wars <- clone_wars %>% 
  distinct()

  # confirm we only have 1 row per name
nonclone_wars %>% 
  count(name) %>% 
  filter(n > 1)

The last piece of code returns A tibble: 0 x 2, meaning none of the character names appeared more than once nonclone_wars - success!

By default, distinct() will be applied to all columns if no columns are specified. In other words, it searches for unique rows across all columns in a dataframe. If there are two Chewbaccas but each with different weights, then both Chewbaccas will be returned.

Alternatively, specifying specific columns will only return unique rows in those columns, as shown below. This can also be useful if you are looking for specific combinations of data, such as starwars$*_color variables.

clone_wars %>% 
  distinct(name, height, mass)

clone_wars %>% 
  distinct(across(contains("color")))

5.9 joining dataframes

In many cases, data is spread across multiple datasets. While your measurement data is collected in file, you may have some metadata lying around elsewhere. If you’re a well-prepared data scientists who systematically collects their data, then you will have included variables that are common among these datasets (e.g. names of Star Wars characters) that uniquely identify each data point (otherwise, you’re in for a real treat…). These common variables are referred to as keys, which enable merging or joining data together to a single dataset for analysis.

There are two ways of joining tables: 5.9.1 mutating joins and 5.9.2 filtering joins. A mutating join adds new variables to one dataframe from matching observations in another. Conversely, a filtering join filters observations from one data frame based on whether they do/don’t match in the other table. For further reading, examples and visualizations, see R4DS: Relational data.

To show the difference between these join types, we will create two sets of data using tibble::tribble(). The movies object shows names of movies, the year in which they were released and the worldwide box office gross (in million dollars). The publishers dataset contains information on the publishing studios of a couple of films.

movies <- tibble::tribble(
  ~movie,                ~yr_released,   ~box_office,
  "The Lion King",           1994L,        968.6,
  "Up",                      2009L,        735.1,
  "Finding Nemo",            2003L,        940.3, 
  "Return of the Jedi",      1983L,        475.3, 
  "Raiders of the Lost Ark", 1981L,        389.9,
  "The Matrix",              1999L,        465.3,
  "Star Wars",               1977L,        775.5, 
  "Avengers: Endgame",       2019L,       2797.8, 
  "Iron Man",                2008L,        585.8,
  "The Notebook",            2004L,        117.8,
  "Guardians of the Galaxy", 2014L,        773.4
)

publishers <- tibble::tribble(
  ~movie,                     ~studio,
  "The Lion King",            "Walt Disney Pictures",
  "Up",                       "Walt Disney Pictures",
  "Finding Nemo",             "Walt Disney Pictures",
  "Return of the Jedi",       "Lucasfilm",
  "Raiders of the Lost Ark",  "Lucasfilm",
  "Star Wars",                "Lucasfilm",
  "Avengers: Endgame",        "Mavel Studios",
  "Iron Man",                 "Mavel Studios",
  "Guardians of the Galaxy",  "Mavel Studios"
)

5.9.1 mutating joins

There are several ways to add columns from one dataset to another, matching rows based on common identifiers (keys). As described in ?'mutate-joins':

  • inner_join(): includes all rows in x (movies) and y (publishers).

  • left_join(): includes all rows in x (movies).

  • right_join(): includes all rows in y. (publishers)

  • full_join(): includes all rows in x (movies) or y (publishers).

Let’s apply these functions to join the movies and publishers datasets. Remember that the output of whatever comes before %>% is used as the first argument (unless explicitly defined otherwise) of what comes after it. In other words, x %>% inner_join(y) is identical to inner_join(x, y).

  # inner_join: only rows that appear in both datasets
movies %>% inner_join(publishers)
Joining with `by = join_by(movie)`
# A tibble: 9 × 4
  movie                   yr_released box_office studio              
  <chr>                         <int>      <dbl> <chr>               
1 The Lion King                  1994       969. Walt Disney Pictures
2 Up                             2009       735. Walt Disney Pictures
3 Finding Nemo                   2003       940. Walt Disney Pictures
4 Return of the Jedi             1983       475. Lucasfilm           
5 Raiders of the Lost Ark        1981       390. Lucasfilm           
6 Star Wars                      1977       776. Lucasfilm           
7 Avengers: Endgame              2019      2798. Mavel Studios       
8 Iron Man                       2008       586. Mavel Studios       
9 Guardians of the Galaxy        2014       773. Mavel Studios       

  # left_join: add information to all rows of movies
movies %>% left_join(publishers)
Joining with `by = join_by(movie)`
# A tibble: 11 × 4
   movie                   yr_released box_office studio              
   <chr>                         <int>      <dbl> <chr>               
 1 The Lion King                  1994       969. Walt Disney Pictures
 2 Up                             2009       735. Walt Disney Pictures
 3 Finding Nemo                   2003       940. Walt Disney Pictures
 4 Return of the Jedi             1983       475. Lucasfilm           
 5 Raiders of the Lost Ark        1981       390. Lucasfilm           
 6 The Matrix                     1999       465. <NA>                
 7 Star Wars                      1977       776. Lucasfilm           
 8 Avengers: Endgame              2019      2798. Mavel Studios       
 9 Iron Man                       2008       586. Mavel Studios       
10 The Notebook                   2004       118. <NA>                
11 Guardians of the Galaxy        2014       773. Mavel Studios       

  # right_join: add information to all rows of publishers
movies %>% right_join(publishers)
Joining with `by = join_by(movie)`
# A tibble: 9 × 4
  movie                   yr_released box_office studio              
  <chr>                         <int>      <dbl> <chr>               
1 The Lion King                  1994       969. Walt Disney Pictures
2 Up                             2009       735. Walt Disney Pictures
3 Finding Nemo                   2003       940. Walt Disney Pictures
4 Return of the Jedi             1983       475. Lucasfilm           
5 Raiders of the Lost Ark        1981       390. Lucasfilm           
6 Star Wars                      1977       776. Lucasfilm           
7 Avengers: Endgame              2019      2798. Mavel Studios       
8 Iron Man                       2008       586. Mavel Studios       
9 Guardians of the Galaxy        2014       773. Mavel Studios       

  # full_join: combine all information of both datasets
movies %>% full_join(publishers)
Joining with `by = join_by(movie)`
# A tibble: 11 × 4
   movie                   yr_released box_office studio              
   <chr>                         <int>      <dbl> <chr>               
 1 The Lion King                  1994       969. Walt Disney Pictures
 2 Up                             2009       735. Walt Disney Pictures
 3 Finding Nemo                   2003       940. Walt Disney Pictures
 4 Return of the Jedi             1983       475. Lucasfilm           
 5 Raiders of the Lost Ark        1981       390. Lucasfilm           
 6 The Matrix                     1999       465. <NA>                
 7 Star Wars                      1977       776. Lucasfilm           
 8 Avengers: Endgame              2019      2798. Mavel Studios       
 9 Iron Man                       2008       586. Mavel Studios       
10 The Notebook                   2004       118. <NA>                
11 Guardians of the Galaxy        2014       773. Mavel Studios       

To make sure the logic behind these joins can sink in, we’ll discuss each of the examples separately. First of all, each of the resulting joins is preceded by Joining, by = "movie", meaning that the movie variable is used as the join key (i.e. the common variable between the tables used to join them together).

  • inner_join(): The resulting dataset includes only rows that appear in both datasets. Because The Matrix and The Notebook aren’t part of publishers, they do not appear in the joined data.

  • left_join(): We start with all rows in movies (the ‘first’ or left table) and add all rows from publishers. Because The Matrix and The Notebook aren’t part of publishers, the studio variable shows NA for these rows. All information of the first (left) table is retained, and data is added wherever there are matching records in the second (right) table.

  • right_join(): This result is identical to what we’ve obtained using inner_join(), but for very different reasons. While inner_join() joins data using matching rows (based on keys, i.e. the movie column), right_join() keeps all information in the ‘second’ (i.e. right) table (publishers), and adds information from the first (movies). Because The Matrix and The Notebook aren’t part of publishers, they do not appear in the joined data. Think of right_join() as the mirror function of left_join() (attaching data from left to right, rather than from right to left).

  • full_join(): All rows from movies and publishers are returned, matched based on the movie column. The resulting dataset is identical to what we’ve obtained in left_join(), but for very different reasons. Namely, left_join starts with all rows in the first dataset (movies), and adds all matching rows from the second (publishers). At the same time, full_join returns all rows from both datasets, even if there are non-matched rows (yielding NAs).

Furthermore, all *_join() automatically search for common variables in both datasets, and include all possibilities. If you want to specifically choose which variables to join on, the by argument is used.

  # one variable
movies %>% inner_join(publishers, by = "movie")

All in all, left_join() is likely to be your bread and butter for merging datasets. As stated in R4DS, it is the most commonly used join because it preserves the original observations even when there isn’t a match. As such, it is recommended to be your default join, unless you have a strong reason to prefer one of the others.

5.9.2 filtering joins

Joining can also be very beneficial to filter datasets. For instance, you may be dealing with one dataset containing measurement data and IDs, while another dataset contains IDs and geographical coordinates. However, you may only want to keep measurement data for which you have matching coordinates (and omit data that lacks this metadata). With many possible use cases, we simply cannot pass up on showing semi_join() and anti_join(). Mind you that, compared to mutating joins, these functions do not add variables from one dataset to another. They simply retain (or omit) columns in one dataframe based on matches in the other.

Imagine we want to retain all rows in movies for which we have matching information in publishers. This returns the original moviesobject, but without The Matrix and The Notebook (as they are not part of publishers).

movies %>% semi_join(publishers)
Joining with `by = join_by(movie)`
# A tibble: 9 × 3
  movie                   yr_released box_office
  <chr>                         <int>      <dbl>
1 The Lion King                  1994       969.
2 Up                             2009       735.
3 Finding Nemo                   2003       940.
4 Return of the Jedi             1983       475.
5 Raiders of the Lost Ark        1981       390.
6 Star Wars                      1977       776.
7 Avengers: Endgame              2019      2798.
8 Iron Man                       2008       586.
9 Guardians of the Galaxy        2014       773.

In other words, semi_join() keeps all rows in the first dataset (movies) that have matching rows (based on the common variable movie) in the second dataset (publishers), but no columns are added (e.g. publishers$studio). Conversely, anti_join() does the opposite.

movies %>% anti_join(publishers)
Joining with `by = join_by(movie)`
# A tibble: 2 × 3
  movie        yr_released box_office
  <chr>              <int>      <dbl>
1 The Matrix          1999       465.
2 The Notebook        2004       118.

Indeed, The Matrix and The Notebook do not appear in publishers, so these are the only rows from movies that are returned.

5.9.3 joining with non-identical keys

Sometimes, datasets may share identical keys but with different column names. As R is very strict in terms of accessing named objects, it is crucial to know how to deal with names that are not identical. To do so, we will slightly alter the movies dataset created earlier. We will use another of dplyrs functions: rename().

movies_edit <- movies %>% rename("MOVIE" = movie)

If we, then, try to join these datasets, an error message will appear.

movies_edit %>% left_join(publishers)
Error in `left_join()`:
! `by` must be supplied when `x` and `y` have no common variables.
ℹ Use `cross_join()` to perform a cross-join.

The error message clearly states that these datasets do not have any common variables, so the operation is halted. To overcome problems such as these, we explicitly combine all combinations using the by argument.

movies_edit %>% left_join(publishers, by = c("MOVIE" = "movie"))
# A tibble: 11 × 4
   MOVIE                   yr_released box_office studio              
   <chr>                         <int>      <dbl> <chr>               
 1 The Lion King                  1994       969. Walt Disney Pictures
 2 Up                             2009       735. Walt Disney Pictures
 3 Finding Nemo                   2003       940. Walt Disney Pictures
 4 Return of the Jedi             1983       475. Lucasfilm           
 5 Raiders of the Lost Ark        1981       390. Lucasfilm           
 6 The Matrix                     1999       465. <NA>                
 7 Star Wars                      1977       776. Lucasfilm           
 8 Avengers: Endgame              2019      2798. Mavel Studios       
 9 Iron Man                       2008       586. Mavel Studios       
10 The Notebook                   2004       118. <NA>                
11 Guardians of the Galaxy        2014       773. Mavel Studios       

5.10 Overview

Below is a brief overview of the dplyr functions covered in 5. Data manipulation with dplyr.

  • select() subsets columns (and optionally renames) using their names

  • filter() subsets rows using column values

  • arrange() arrange rows by column values

  • mutate() adds new variables that are functions of existing variables

  • group_by() applies grouping by one (or more) variables (this doesn’t change how the data looks, but changes how it acts with other dplyr verbs)

  • ungroup() complements group_by() by removing a layer of grouping

  • summarise()/summarize() reduces multiple values down to a single summary (removes a layer of grouping)

  • slice() and its derivatives: slice(), slice_head(), slice_tail()´, slice_max(), slice_min()

  • distinct() selects only unique/distinct rows from a data frame

  • mutating joins add columns from dataset y to dataset x (see ?‘mutate-joins’ for more information)

    • inner_join(), left_join(), right_join(), full_join()´
  • filtering joins filter rows from x based on the presence/absence of matches in y (see ?‘filter-joins’ for more information)

    • semi_join(), anti_join()

6. Tidy data with tidyr

tidyr (prounounced tidy-r) is designed to create tidy data. While data is often organised in such a way that facilitates entry and reporting, it is usually not straightforward to process such data structures in R. For instance, your raw data may be formatted as a wide table, while you want this dataset to be formatted as a long table. The long format, in particular, is usually preferred when working with tidyverse. As most built-in R functions work with vectors of values, it is only natural that tidyverse would follow suit.

As was the case for dplyr, we cannot cover all tidyr functions within this course. Instead, we will focus on what many consider to be the most common/important functions for pivoting: pivot_longer() and pivot_wider().

6.1 Pivoting

In many cases, you may face a very ‘wide’ table. Such tables often contain information scattered across multiple columns, in which the names of variables are actually values of a variable. The opposite of a wide table is a ‘long’ table, which is a table where observations are scattered across multiple rows. To transform tables from wide to long and backwards, we can use pivot_longer() and pivot_wider().

Because creating exemplary datasets to show the power of tidyr pivoting is rather cumbersome, I’ll be using slightly reworded examples as provided in the tidyr pivoting vignette. There are even more advanced ways to deal with very untidy data than what I can show in this course (again, make sure to check out the vignette!), though the logic and approach in the code blocks below should already get you most of the way.

6.1.1 pivot_longer()

The tidyr package provides a couple of example datasets to practice our pivoting skills. One such dataset is billboard, which shows a ranking of songs for the Billboard top 100 in the year 2000. Ready for a hit of nostalgia?

head(billboard)
# A tibble: 6 × 79
  artist      track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
  <chr>       <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 Pac       Baby… 2000-02-26      87    82    72    77    87    94    99    NA
2 2Ge+her     The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
3 3 Doors Do… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
4 3 Doors Do… Loser 2000-10-21      76    76    72    69    67    65    55    59
5 504 Boyz    Wobb… 2000-04-15      57    34    25    17    17    31    36    49
6 98^0        Give… 2000-08-19      51    39    34    26    26    19     2     2
# ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
#   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
#   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
#   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
#   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
#   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
#   wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …

Because billboard is a tibble, head() nicely shows the first rows of the dataset (compare with head(as.data.frame(billboard)). The dataset contains columns for the song artist and the track name, as well as on which date the song entered the Billboard top 100 (date.entered). Furthermore, it shows a column for each consecutive week (columns with the prefix wk) the song was part of the list, along with the position (or rank) of the song within said list. We can probably agree that this way of showing data is highly inconvenient for analysis.

A better way of displaying this dataset is by condensing the week and rank variables to their own columns; a perfect task for pivot_longer()! To make this function do the heavy lifting, we need three parameters:

  • The columns whose names are actually values, rather than real variables. In billboard, those are all the wk* columns. In pivot_longer(), this info (i.e. all columns to pivot into longer format) is passed on to the cols argument.

  • These wk* columns contain two pieces of information. The first piece is the number of weeks the song was part of the top 100 (wk1, wk2, …). We are moving the week information (column names) to a single variable, and give it a name: week. The name we want to give to this column is passed on to the names_to argument (“send the names of all columns in cols to …”).

  • These wk* columns contain two pieces of information. The second piece is the position (rank) the song occupied in the top 100. This column also requires a name, so let’s call it rank. Finally, this info (the values contained within each row of the wk* column) is passed on to the values_to argument (“send all values of all columns in cols to …”).

billboard %>% 
  pivot_longer(
    cols = starts_with("wk"), 
    names_to = "week", 
    values_to = "rank"
  )
# A tibble: 24,092 × 5
   artist track                   date.entered week   rank
   <chr>  <chr>                   <date>       <chr> <dbl>
 1 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk1      87
 2 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk2      82
 3 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk3      72
 4 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk4      77
 5 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk5      87
 6 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk6      94
 7 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk7      99
 8 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk8      NA
 9 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk9      NA
10 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk10     NA
# ℹ 24,082 more rows

The resulting dataframe contains the artist, track, and date.entered columns, along with the pivoted columns week and rank. Great!

As cool (and tidy!) as the resulting dataset is, we can specify additional arguments in pivot_longer() to make the resulting dataset even cleaner. Below, we are adding names_prefix, names_transform, and values_drop_na (see ?pivot_longer for more information).

billboard %>% 
  pivot_longer(
    cols = starts_with("wk"), 
    names_to = "week", 
    names_prefix = "wk",
    names_transform = list(week = as.integer),
    values_to = "rank",
    values_drop_na = TRUE,
  )
# A tibble: 5,307 × 5
   artist  track                   date.entered  week  rank
   <chr>   <chr>                   <date>       <int> <dbl>
 1 2 Pac   Baby Don't Cry (Keep... 2000-02-26       1    87
 2 2 Pac   Baby Don't Cry (Keep... 2000-02-26       2    82
 3 2 Pac   Baby Don't Cry (Keep... 2000-02-26       3    72
 4 2 Pac   Baby Don't Cry (Keep... 2000-02-26       4    77
 5 2 Pac   Baby Don't Cry (Keep... 2000-02-26       5    87
 6 2 Pac   Baby Don't Cry (Keep... 2000-02-26       6    94
 7 2 Pac   Baby Don't Cry (Keep... 2000-02-26       7    99
 8 2Ge+her The Hardest Part Of ... 2000-09-02       1    91
 9 2Ge+her The Hardest Part Of ... 2000-09-02       2    87
10 2Ge+her The Hardest Part Of ... 2000-09-02       3    92
# ℹ 5,297 more rows

Compared to the previous code block, names_prefix removes the matching text from the start of each variable name, while names_transform changes the resulting datatype of the week column to an integer (otherwise it would be a character column). The nature of the original (wide) table also made room for many NA values (try view(billboard) and scroll through the dataset to see for yourself) because not every song appeared in the top 100 for more than 50 consecutive weeks (only 4 songs made it that far!). After pivoting, these NAs are removed with values_drop_na = TRUE.

6.1.2 pivot_wider()

While pivot_longer() decreases the number of variables and increases the number of rows, pivot_wider() does the exact opposite. Even though wide tables aren’t always very tidy, they do have their own use cases.

First of all, let’s try and reverse the edited billboard dataset (see 6.1.1 pivot_longer().

  # simple pivot_longer()
billboard_long <- billboard %>% 
  pivot_longer(
    cols = starts_with("wk"), 
    names_to = "week", 
    values_to = "rank"
  )

  # bring long back to wide
billboard_long %>% 
  pivot_wider(
    names_from = "week", 
    values_from = "rank"
  )
# A tibble: 317 × 79
   artist     track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
   <chr>      <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 2 Pac      Baby… 2000-02-26      87    82    72    77    87    94    99    NA
 2 2Ge+her    The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
 3 3 Doors D… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
 4 3 Doors D… Loser 2000-10-21      76    76    72    69    67    65    55    59
 5 504 Boyz   Wobb… 2000-04-15      57    34    25    17    17    31    36    49
 6 98^0       Give… 2000-08-19      51    39    34    26    26    19     2     2
 7 A*Teens    Danc… 2000-07-08      97    97    96    95   100    NA    NA    NA
 8 Aaliyah    I Do… 2000-01-29      84    62    51    41    38    35    35    38
 9 Aaliyah    Try … 2000-03-18      59    53    38    28    21    18    16    14
10 Adams, Yo… Open… 2000-08-26      76    76    74    69    68    67    61    58
# ℹ 307 more rows
# ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
#   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
#   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
#   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
#   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
#   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …

As you can see, the logic and reasoning behind these functions is relatively interchangeable. While you are moving variable names_to one single column with pivot_longer(), you are moving all names_from a single variable to multiple columns with pivot_wider() (the same reasoning applies for values_to and values_from). Some slightly more advanced pivoting is shown below.

  # more advanced pivot_longer()
billboard_long <- billboard %>% 
  pivot_longer(
    cols = starts_with("wk"), 
    names_to = "week", 
    names_prefix = "wk",
    names_transform = list(week = as.integer),
    values_to = "rank",
    values_drop_na = TRUE,
  )

  # bring long back to wide
billboard_long %>% 
  pivot_wider(
    names_from = "week", 
    values_from = "rank",
    names_prefix = "wk", 
    values_fill = NA
  )
# A tibble: 317 × 68
   artist     track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
   <chr>      <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 2 Pac      Baby… 2000-02-26      87    82    72    77    87    94    99    NA
 2 2Ge+her    The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
 3 3 Doors D… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
 4 3 Doors D… Loser 2000-10-21      76    76    72    69    67    65    55    59
 5 504 Boyz   Wobb… 2000-04-15      57    34    25    17    17    31    36    49
 6 98^0       Give… 2000-08-19      51    39    34    26    26    19     2     2
 7 A*Teens    Danc… 2000-07-08      97    97    96    95   100    NA    NA    NA
 8 Aaliyah    I Do… 2000-01-29      84    62    51    41    38    35    35    38
 9 Aaliyah    Try … 2000-03-18      59    53    38    28    21    18    16    14
10 Adams, Yo… Open… 2000-08-26      76    76    74    69    68    67    61    58
# ℹ 307 more rows
# ℹ 57 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
#   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
#   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
#   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
#   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
#   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …

6.1.3 An additional note on pivoting

Hopefully, you’ll never need the information in this sub-section. However, from experience, data sometimes doesn’t lend itself to be pivoted properly. To show you what this could look like, I will generate a very small dataset. Imagine that, for whatever reason, your dataset is structured as follows:

my_data <- tibble(
  plot = rep(c(LETTERS[1:3]), 2), 
  num = seq_along(plot)
)

my_data
# A tibble: 6 × 2
  plot    num
  <chr> <int>
1 A         1
2 B         2
3 C         3
4 A         4
5 B         5
6 C         6

Let’s try and pivot this to a wide table using pivot_wider().

my_data %>% 
  pivot_wider(
    names_from = plot, 
    values_from = num
  )
Warning: Values from `num` are not uniquely identified; output will contain list-cols.
• Use `values_fn = list` to suppress this warning.
• Use `values_fn = {summary_fun}` to summarise duplicates.
• Use the following dplyr code to identify duplicates.
  {data} %>%
  dplyr::group_by(plot) %>%
  dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
  dplyr::filter(n > 1L)
# A tibble: 1 × 3
  A         B         C        
  <list>    <list>    <list>   
1 <int [2]> <int [2]> <int [2]>

The resulting dataset contains three columns (one for each of the plots in my_data), each with a list-column containing 2 values. The reason the result is structured as such, is already hinted at by the warning message: Values are not uniquely identified; output will contain list-cols. In other words, the function could not identify the rows each of the num values belongs to. Because the function wants the data to go somewhere, they are coerced into 1 single row, using lists.

To ensure proper widening of the table (i.e. without creating list-columns), we need to create an additional variable that uniquely identifies each row. To do so, we need to group_by(plot) and create an ID for each row of observations. Because each plot has 2 observations, the ID column will contain ‘1’ and ‘2’. Finally, we can pivot tables to our heart’s content (but don’t forget to ungroup!).

my_data %>% 
  group_by(plot) %>% 
  mutate(ID = row_number()) %>% 
  ungroup()  %>% 
  pivot_wider(
    names_from = plot, 
    values_from = num,
  )
# A tibble: 2 × 4
     ID     A     B     C
  <int> <int> <int> <int>
1     1     1     2     3
2     2     4     5     6

7. DataViz: ggplot2

Humans are very visually oriented beings. We are naturally drawn to colours and figures, more so than to tables filled with numbers and symbols. We’re also able to ingest a lot of information quickly from a picture or graph, if it is constructed logically and visually pleasing. And as much as we can develop our data manipulation skills, we hardly get to yield the delicious fruits of labour unless we can pour these wrangled data into neat-looking visualizations. Within the tidyverse, ggplot2 is the package for creating graphics.

ggplot2 will probably be familiar to most of you. In case you have already used ggplot2 extensively there will be little new for you in this section. However, if you’re new to the package, if you want to refresh your skills or have been using it without actually understanding how it works, then tag along!

7.1 Your first plot

Let’s return to the world of Star Wars for one brief moment. Remember that we have information on character names, height, mass, etc. Perhaps we can find a relationship (or the lack thereof) between height and mass, which we can quickly explore using a visualization.

To do so, ggplot2 requires at least 2 basic building blocks. For the first one, we need to tell R we are initializing a ggplot object, using ggplot().

  # initialize ggplot object with starwars data
ggplot(data = starwars)

  # same, but written using %>%
starwars %>% 
  ggplot()

If the stars aligned, then you have now created a plot that is emptier than the vast expanse of space. Indeed, you’ve told R you are creating a ggplot object, but it has not received any instructions as to what you want to display. Other than initializing a ggplot object, we also need to define what and how we want to visualize our data.

To do so, we add another layer: a geometric object (geom). These so-called ‘geoms’ are the actual marks/data points/… you see in a plot. If you want to create a plot with lines, geom_line() is your best friend. If you want to create a histogram or boxplot, then geom_histogram() and geom_boxplot() have got your back. See here for a full list of all possible geoms within ggplot2 (bear in mind, there are many other possibilities via ggplot2 add-ons!).

For this example, we’ll create a scatter plot using geom_point(). Adding new layers to a ggplot (in which a new layer is layered upon the previous one, much akin to working in image editing software!) is done using the + operator (rather than using %>%, much to one of the developer’s frustration).

starwars %>% 
  ggplot() +
  geom_point()
Error in `geom_point()`:
! Problem while setting up geom.
ℹ Error occurred in the 1st layer.
Caused by error in `compute_geom_1()`:
! `geom_point()` requires the following missing aesthetics: x and y

Oof, a wild error message appeared! Even though we defined how we wanted to present our data (point shapes; scatter plot), we forgot to define what we want to visualize. To do so, we need to tell ggplot we are mapping height to the x-axis, and mass to the y-axis. This can be done using aes (short for ‘aesthetics’). Such aesthetic mappings describe how variables (in your data) are mapped to certain visual properties (aesthetics) of the geoms you’ve selected.

starwars %>% 
  ggplot() +
  geom_point(
    mapping = aes(x = height, y = mass)
  )
Warning: Removed 28 rows containing missing values (`geom_point()`).

Very nice! We have now created a very basic scatter plot, and it already shows we have a linear relationship between the x and y, other than (what appears to be) a massive outlier. We also obtained a warning message (Removed 28 rows containing missing values (geom_point).), which means some rows did not contain any data that could be visualized (e.g. NAs).

7.2 Building blocks of a ggplot

Even though ggplot has taken care of a lot of details to create the scatterplot above, there are a lot of things we can do to make the plot look better. These building blocks consist of the options below distinguishing between the required and not required parts. The non-required parts will be filled in by ggplot itself, using some default settings to make your life easier.

  # REQUIRED
ggplot() +
  geom_function() +
  
  # NOT REQUIRED
  coordinate_function() +
  facet_function() +
  scale_function() +
  theme_function()

You may notice that we’ve already covered the required blocks (albeit very briefly), and will now explore ways to flavour a new plot to our personal taste.

7.3 Data preparation

Let’s move away from Star Wars and into a fresh, new dataset. We’ll start by loading ‘Animal Rescues’, supplied by London.gov and shared by the TidyTuesday community. Now we are all on the same playing field, as this dataset is completely new to me! We will prepare this dataset for visualization, and build our figure step-by-step.

animal_rescues_raw <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-06-29/animal_rescues.csv')

It’s important to know the data we’re working with, so feel free to explore the contents of the dataset. Fortunately, the TidyTuesday readme provides us with context and descriptions of each column. From this file, we know this dataset contains information on animal (animal_group_parent) rescues performed by the London Fire Brigade. Additionally, it contains monthly-updated data going back to 2009 (cal_year), as well as the total cost of each incident (incident_notional_cost).

One thing’s for certain: we can do many things with this dataset. For the purpose of this course, I want to visualize the number of animals rescued in the last couple of years. Let’s do some very quick data cleaning to keep only data we want to visualize. Most of these steps have been covered in the past sections, with some new elements added to keep things interesting!

  # prepare animal_rescues for dataviz
animal_rescues <- animal_rescues_raw %>% 
    # select and rename only columns of interest
  select(
    "animal" = animal_group_parent, 
    "year" = cal_year, 
    "cost" = incident_notional_cost
    ) %>% 
    # keep only data for the years 2015 to 2021
  filter(year >= 2016) %>% 
    # keep only distinct animal names
  filter(!str_detect(animal, "Unknown")) %>% 
    # change all animal names to lowercase (preventing separate counts of e.g. Cat vs cat)
  mutate(
    across(animal, ~tolower(.x)), 
    across(cost, ~as.numeric(.x))
    ) %>% 
    # keep only animals with at least 5 incidents
  group_by(animal) %>% 
  filter(n() >= 5) %>% 
  ungroup() 
  
glimpse(animal_rescues)
Rows: 3,305
Columns: 3
$ animal <chr> "cat", "bird", "dog", "dog", "dog", "cat", "cat", "bird", "dog"
$ year   <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 201…
$ cost   <dbl> 298, 298, 298, 298, 298, 298, 298, 298, 298, 298, 298, 298, 119…

7.4 Building the plot

Now we’ve prepared our data, we can start building the figure. As before, we’ll start with initializing the ggplot object, choosing a geom and mapping our variables of interest to the aesthetics of the plot. Given the data, I feel like a bar plot sounds reasonable!

To create a bar plot, we have two options: geom_bar() and geom_col(). The first will create a bar plot, in which each bar is assigned a height proportional to the number of cases in each group. In other words, geom_bar() will automatically calculate how many observations you have in each group (much like dplyr::count on grouped data), and display these numbers accordingly. As we didn’t define any factor levels to the animal column, ggplot automatically orders the variable alphabetically.

ggplot() +
  geom_bar(data = animal_rescues, mapping = aes(x = animal))

Alternatively, if you want to create a bar plot in which the height (along the y-axis) represents actual values in your data, geom_col() is used. That said, we will continue using geom_bar()!

ggplot() +
  geom_col(data = animal_rescues, mapping = aes(x = animal, y = cost))
Warning: Removed 24 rows containing missing values (`position_stack()`).

7.4.1 Coordinate systems

One particular feature of bar plots is how axis labels can quickly become unreadable. For this purpose, one can easily flip a bar plot sideways using one of the ggplot coordinate systems, coord_flip(), which we add as another layer to the ggplot call.

ggplot() +
  geom_bar(data = animal_rescues, mapping = aes(x = animal)) +
  coord_flip()

7.4.2 Faceting

Currently, our plots show a summary of animal incidents across all years available in our data. We could also be interested in splitting our data according to the year in which the incident took place. This caneasily be done using one of the facet_*() functions: facet_wrap() and facet_grid(). We will choose the first one, because we only have 1 discrete variable (year) along which we want to visualize the animal counts.

To use facet_wrap(), we need to define which variable to use for faceting. This can be done either with vars(), a one-sided formula ~variable or a character vector c("variable"); I personally prefer the ~ syntax.

ggplot() +
  geom_bar(data = animal_rescues, mapping = aes(x = animal)) +
  coord_flip() +
  facet_wrap(~year)

7.4.3 Scales

Scales are the functions that control the details of how your data are translated into visual properties. Even though we mapped the animal variable to the x-axis (and then flipped it using coord_flip()), we didn’t give any additional information on colouring and filling these bars. The easy way to solve this, is by telling ggplot what we want to colour, without even telling it which colours.

Let’s tell ggplot we want to fill our bars with a colour according to animal, and give each bar a black outline. For the colour filling, we use fill inside aes(), while the black outlines of each bar are defined outside of aes().

ggplot() +
  geom_bar(
    data = animal_rescues, 
    mapping = aes(x = animal, fill = animal), colour = "black"
    ) +
  coord_flip() +
  facet_wrap(~year)

To clarify, whenever we want to change colour/shape/size/… according to a variable (i.e. mapping a variable to a visual property), we need to add this argument inside of aes(). In all cases where we want to create a visual property across the entire geom (disregarding any groups in our data), we add the argument outside of the aes() call. Notice how ggplot also adds a legend (automatically) to the side of the plot for each visual property inside of aes().

Now we’ve mapped animal to the fill aesthetic, ggplot knows that it needs to apply a colour fill to each distinct animal in our bar plot. But we can take the customization to an even higher level by specifying which colours using one of the scale_*_*() functions. The first * in scale_*_*() is a placeholder for the visual property (fill, shape, size, …) you want to customize, while the second * is a placeholder for the type of scaling you want to apply (using a pre-built colour palette, or building your own). In this case, we want to apply colouring to the fill aesthetic, and we will use one of the colour palettes provided by brewer (see <https://colorbrewer2.org> for more information): scale_fill_brewer(palette = "Set3").

ggplot() +
  geom_bar(
    data = animal_rescues, 
    mapping = aes(x = animal, fill = animal), colour = "black"
    ) +
  coord_flip() +
  facet_wrap(~year) +
  scale_fill_brewer(palette = "Set3")

7.4.4 Themes

Themes are used to polish the appearance of a plot. If you don’t want to tinker too much with too many details, ggplot2 offers a series of complete themes. Applying a theme to your plot is as simple as adding a theme_*() layer to your plot. I personally tend to go for either theme_classic() or theme_bw().

ggplot() +
  geom_bar(
    data = animal_rescues, 
    mapping = aes(x = animal, fill = animal), colour = "black"
    ) +
  coord_flip() +
  facet_wrap(~year) +
  scale_fill_brewer(palette = "Set3") +
  theme_classic()

7.4.5 Final edits

And there you go! We’ve created a plot that is much more pleasing than the figure we built using only geom_*(), and with only a little bit of extra effort! Using the building blocks discussed so far will get you 90% of the way to create juicy DataViz, but the remaining 10% is still up for grabs… All aboard the perfectionism bandwagon! Bear in mind that some of these steps will be more advanced than what we’ve discussed so far. Knowing that these possibilities exist is, therefore, more important than knowing how-to.

There are four aspects of the plot that I’d like to improve. First of all, some animals hardly have any incidents recorded in some years. At this point, it would make more sense to lump these animals under an ‘other’ category. Secondly, the ordering of the bars isn’t very visually pleasing. For part three, I want to add numbers to each bar showing how many incidents occurred for each animal. The last aspect involves minor details that, as a whole, should improve the plot’s readability and general ‘look’. Along with these, I’m going to slightly rewrite the ggplot code block by moving all data-related aesthetic mappings to the ggplot() call (as well as removing the names of the arguments), and remap animal to the y-axis, making coord_flip() redundant. The rewritten (but not yet finalized code) looks as follows:

animal_rescues %>% 
  ggplot(aes(y = animal, fill = animal)) +
  geom_bar(colour = "black") +
  facet_wrap(~year) +
  scale_fill_brewer(palette = "Set3") +
  theme_classic()

First of all, let’s lump the three species that appear the least in our data. This can be done using functions from the forcats package: a series of helpers for manipulating factors. Even though we didn’t cover this package before, it is more important to know that they exist and that they can easily be Googled (admittedly, I don’t often work with these functions, so I always end up browsing the Internet for help).

This particular operation can be achieved using the fct_lump() family of functions. Because we want to keep only the groups that appear most frequently, we will use fct_lump_n() to lump everything but the 7 most frequent animals.

While we’re at it, we’ll reorder animal according to the number of incidents for each animal (across all years in our dataset). I will use fct_infreq() to reorder the levels of animal by the number of observations within each level. Additionally, I will reverse the order of fct_infreq() to show the bars in descending order using fct_rev(). As a side remark: I will always try to wrangle my data before throwing it into ggplot to increase readability and predictability of my code. I also tend to break up mutations into different lines of code for the same reasons. Disclaimer: in case the data wrangling is computationally heavy, I will store the pre-wrangled data in another object before initializing a ggplot (so R doesn’t need to calculate the same data over and over again).

animal_rescues %>% 
  mutate(
      # lump least frequent groups
    animal = fct_lump_n(animal, n = 7),
      # reorder variable according to number of observations 
    animal = fct_infreq(tolower(animal)), 
      # reverse ordering to show bars in descending order
    animal = fct_rev(animal)
    ) %>% 
  ggplot(aes(y = animal, fill = animal)) +
  geom_bar(colour = "black") +
  facet_wrap(~year) +
  scale_fill_brewer(palette = "Set3", direction = -1) +
  theme_classic()

Then, we can add text labels (numbers) to the plot indicating how many incidents occurred for each animal. To do so, we add a new geometric object: geom_text(). Because our dataset does not contain summarized count data for each animal (per year), we’ll need to tell geom_text() that it needs to calculate these numbers on the fly, hence stat = "count" (which is what geom_bar() does by default). Because of this, ggplot also knows where to put the text, but not what text to use. Because the number depends on animal (because we already defined y in ggplot(), all subsequent geoms will ‘inherit’ this information), we only need to add the label argument to aes(). Given we want the count data calculated by the geom, we use the ..stat.. syntax, in this case: ..count...

No need to worry too much about the details, as this is relatively advanced material. Just know that geom_text() inherited information of the positioning of the labels from ggplot(), and that we need to tell geom_text() explicitly that it needs to count the data before plotting it.

animal_rescues %>% 
  mutate(
      # lump least frequent groups
    animal = fct_lump_n(animal, n = 7),
      # reorder variable according to number of observations 
    animal = fct_infreq(tolower(animal)), 
      # reverse ordering to show bars in descending order
    animal = fct_rev(animal)
    ) %>% 
  ggplot(aes(y = animal, fill = animal)) +
  geom_bar(colour = "black") +
  geom_text(stat = "count", aes(label = ..count..)) +
  facet_wrap(~year) +
  scale_fill_brewer(palette = "Set3", direction = -1) +
  theme_classic()
Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(count)` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.

Finally, I want to customize the general appearance of the plot, such as modifying the appearance of text, removing the legend (as it doesn’t add much information in this figure), etc. To do so, I will add labs() to edit the content of text in the plot, change some specifics in geom_text(), edit the range of the x-axis using scale_x_continuous(), and add theme() for changing the general lay-out of text in the plot.

animal_rescues %>% 
  mutate(
      # lump least frequent groups
    animal = fct_lump_n(animal, n = 7),
      # reorder variable according to number of observations 
    animal = fct_infreq(tolower(animal)), 
      # reverse ordering to show bars in descending order
    animal = fct_rev(animal)
    ) %>% 
  ggplot(aes(y = animal, fill = animal)) +
  geom_bar(colour = "black") +
  geom_text(
    stat = 'count', 
    aes(label = ..count..), 
    hjust = -0.15, col = "grey50", size = 3) +
  facet_wrap(~year) +
  scale_fill_brewer(
    palette = "Set3", direction = -1) +
  scale_x_continuous(
    limits = c(0, 390), 
    expand = c(0, 0)) +
  theme_classic() +
  labs(
    x = "Number of incidents", 
    y = "", 
    title = "Animal Rescues in London", 
    subtitle = "The animals most commonly saved by the London Fire Brigade",
    caption = "
    Other: hamster, snake and rabbit | TidyTuesday data provided by London.gov | DataViz by Stijn Van de Vondel."
  ) +
  theme(
    legend.position = "none",
    plot.title = element_text(face = "bold", size = 18, hjust = 0.5),
    plot.subtitle = element_text(face = "italic", hjust = 0.5), 
    axis.title = element_text(face = "bold", size = 12, colour = "grey30"),
    plot.caption = element_text(size = 10, colour = "grey50", hjust = 0.5), 
    strip.background = element_rect(colour = NA), 
    strip.text = element_text(face = "bold", colour = "#660000"), 
    plot.background = element_rect(colour = NA, fill = "grey95"),
    panel.background = element_rect(colour = NA, fill = "grey95"),
    strip.background.x = element_rect(colour = NA, fill = "grey95")
  )

Customization of a figure can take a very long time. At the same time, I have a strong sense of satisfaction whenever data comes together beautifully in a figure. What ‘beautiful’ looks like is often a combination of personal style and design principles. In the end, what matters is that you can convey a message clearly, concise and visually pleasing to your audience!

7.5 Saving and exporting figures

Now we’ve created a pretty plot, all that remains is to save it! RStudio sports an ‘Export’ button included within the ‘Plots’ pane, but it only produces low resolution image files. Fortunately, ggplot2 comes with the ggsave() function to make our lives easier.

First, we need to decide where to save the image. Depending on your set-up, you can save the image directly into your working directory wd() or specify which (sub-)folder of your working directory to save the file to. All that remains, then, is to give the file a name. I usually include the whole path in the filename argument, but you can also define filename and path separately.

We have two options for telling ggsave() which plot to use. We can either save our plot to an R object using the <- assignment operator, or (if the plot is opened in the ‘Plots’ window) use last_plot() to retrieve the last plot that was created (and, in RStudio, shown in the ‘Plots’ pane).

Finally, we need to specify the graphics device, size (and units of the size parameters), and resolution to create the image. While the size (width and height) arguments determine the overall size of the actual figure, we need to set the plot resolution (dpi) separately. Higher dpi values generate higher quality images at the expense of space on your (hard) drive. ggsave() defaults to dpi = 300, but I would recommend at least 500 in case you want to print your figure.

   # create plot
some_plot <- some_data %>% 
  ggplot() +
  geom_point()

  # save plot
ggsave(
  filename = "PATH/my_plot.png", plot = some_plot, 
  device = "png", units = "in", #inches
  width = 8, height = 6, dpi = 500
  )

  # save plot (in case it can be retrieved from your session)
ggsave(
  filename = "PATH/my_plot.png", plot = last_plot(), 
  device = "png", units = "in", #inches
  width = 8, height = 6, dpi = 500
  )

There are many more arguments and specifications that can be set within ggsave(), which you can read about in the function’s documentation.

8. Life cycle

One last piece of information I wish I knew earlier, is the lifecycle stages of the tidyverse. These stages indicate whether a function is still in an experimental phase, whether it is stable, or whether it has been replaced by a better alternative. These stages are mostly of concern in case you are writing scripts that you intend to use for a long time (unless you create independent packages for these scripts). I will only discuss these stages very briefly; read here or check the vignette("stages") for more information.

The tidyverse lifecycle stages.x

The tidyverse lifecycle stages.x

8.1 Experimental

Sometimes, functions are released in the experimental stage. This generally means that the authors are (cautiously) optimistic about the function, but are waiting for people to try it out and provide feedback. No promises are made in terms of long term stability, reserving the right to the author to make breaking changes (i.e. changes that can break the code that uses this function) without much of a warning. Whenever you come across one of these functions, consider using alternatives.

For example: ?dplyr::group_split.

8.2 Stable

Stable functions are those functions (within the tidyverse) that often do not have a badge sitting atop their documentation (unless the author wants to draw attention to the stability of the function). These functions come with the promise that breaking changes will be avoided whenever possible. If such a change is needed, it will only occur very gradually through the process of ‘deprecation’ (see below).

For example: ?dplyr::mutate.

8.3 Superseded

Whenever a function has a known and better alternative, it will be labelled as being ‘superseded’. It is still safe to use (perhaps safer than a stable function) because it will no longer receive any new features (i.e. the function will never change) except for critical bug fixes. No warning message will be shown in the R console, but the function’s documentation will suggest alternatives.

For example: ?dplyr::mutate_at.

8.4 Deprecated

Whenever you come across a function that is labelled as deprecated, a warning message will be shown advising you to move to one of the suggested alternatives. These functions have been overtaken by better alternatives, and - most importantly - are scheduled for removal.

For example: ?tibble::as_data_frame.

9. Cheat sheets

If you like having a lot of information summarised in cheat sheets, then tidyverse has got you covered. See https://www.rstudio.com/resources/cheatsheets/ for cheat sheets on ggplot2, dplyr, tidyr, and more!

Session Info

R version 4.2.3 (2023-03-15 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=Dutch_Belgium.utf8  LC_CTYPE=Dutch_Belgium.utf8   
[3] LC_MONETARY=Dutch_Belgium.utf8 LC_NUMERIC=C                  
[5] LC_TIME=Dutch_Belgium.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.2  forcats_1.0.0    stringr_1.5.0    dplyr_1.1.2     
 [5] purrr_1.0.1      readr_2.1.4      tidyr_1.3.0      tibble_3.2.1    
 [9] ggplot2_3.4.2    tidyverse_2.0.0  rmdformats_1.0.4 knitr_1.42      

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.0   xfun_0.38          bslib_0.4.2        colorspace_2.1-0  
 [5] vctrs_0.6.1        generics_0.1.3     htmltools_0.5.5    yaml_2.3.7        
 [9] utf8_1.2.3         rlang_1.1.0        jquerylib_0.1.4    pillar_1.9.0      
[13] glue_1.6.2         withr_2.5.0        RColorBrewer_1.1-3 bit64_4.0.5       
[17] lifecycle_1.0.3    munsell_0.5.0      gtable_0.3.3       evaluate_0.20     
[21] labeling_0.4.2     tzdb_0.3.0         fastmap_1.1.1      curl_5.0.0        
[25] parallel_4.2.3     fansi_1.0.4        highr_0.10         tufte_0.13        
[29] scales_1.2.1       cachem_1.0.7       vroom_1.6.3        jsonlite_1.8.4    
[33] farver_2.1.1       bit_4.0.5          hms_1.1.3          digest_0.6.31     
[37] stringi_1.7.12     bookdown_0.33      grid_4.2.3         cli_3.6.1         
[41] tools_4.2.3        magrittr_2.0.3     sass_0.4.5         crayon_1.5.2      
[45] pkgconfig_2.0.3    timechange_0.2.0   rmarkdown_2.21     rstudioapi_0.14   
[49] R6_2.5.1           compiler_4.2.3